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Abstract: Self-normalized processes are basic to many probabilistic and 
statistical studies. They arise naturally in the the study of stochastic inte- 
grals, martingale inequalities and limit theorems, likelihood-based methods 
in hypothesis testing and parameter estimation, and Studentized pivots and 
bootstrap-t methods for confidence intervals. In contrast to standard nor- 
malization, large values of the observations play a lesser role as they appear 
both in the numerator and its self-normalized denominator, thereby mak- 
ing the process scale invariant and contributing to its robustness. Herein we 
survey a number of results for self-normalized processes in the case of de- 
pendent variables and describe a key method called "pseudo-maximization" 
that has been used to derive these results. In the multivariate case, self- 
normalization consists of multiplying by the inverse of a positive definite 
matrix (instead of dividing by a positive random variable as in the scalar 
case) and is ubiquitous in statistical applications, examples of which are 
given. 
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In Memory of Joseph L. Doob 

1. Introduction 

This paper presents an introduction to tlie theory and apphcations of self- 
normalized processes in dependent variables, which was relatively unexplored 
until recently due to difficulties caused by the highly non-linear nature of self- 
normalization. We overcome these difficulties by using the method of mixtures 
which provides a tool for "pseudo-maximization" . 
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We dedicate this paper to the memory of J. L. Doob, a great probabihst 
who generously pointed one of us in this general direction, though we each 
had our independent initial path. While the first author was visiting the Uni- 
versity of Illinois at Urbana-Champaign in the early 1990's, he took frequent 
hiking trips with Doob. Upon reaching a mountain-top in one of these trips, 
he asked Doob what were the most important open problems in probability. 
Doob replied that there were results in harmonic analysis involving harmonic 
functions divided by subharmonic or superharmonic functions that did not yet 
have a nalogues in t he pr obabilistic setting of martingales. Guided by Doob's 



answer, 



de la Penal (|l999l ) developed a new technique for obtaining expo nential 



bounds for martingales. Subsequently, de la Peha, Klass and Lai ( 200d . 2004) 



introduced another method, which we call pseudo-maximization, to derive ex- 
ponential and Lp-bounds for self-normalized processes. Via separate and newly 
innovated methods a universal LIL was also obtained. In this survey we review 
these results as well as those by others that fall (broadly) under the rubric 
of self-normalization. Our choice of topics has been guided by the usefulness 
and definitiveness of the results, and the light they shed on various aspects of 
probabilistic/statistical theory. 

Self-normalized processes arise naturally in statistical applications. In stan- 
dard form (as when connected to CLT's) they are unit-free and often permit 
the weakening or even the elimination of moment assumptions. The prototypical 
example of a self-normalized random process is Student's ^-statistic. It replaces 
the population standard deviation a in y/n{X — \x)la by the sample standard 
deviation. Let {Xi\ be i.i.d normal iV(/i,(T^), 

Y _ YTi^i 2 _ J2'i=ii^i - ^")^ 



Then T„ = ^"^ has the t„_i-distribution. Let Yi — Xi — /i. An = X]r=i 
— We can re-express r„ in terms of A„/i?„ as 

rp _ An/Bn 

-L n. 



^{n^{AjBnY)/{n~l) 

Thus, information on the properties of r„ can be derived from the self-normalized 
process 

A V" Y 

Hence, in a more general context, a self-normalized process assumes the form 
An/ Bm or At/ B-t in continuous time, where Bt is a random variable that is used 
to estimate a dispersion measure of the process At. 

Although the methodology of self-normalization dates back to 1908 when 
William Gosset (aka Student) introduced Student's t-statistic, active devel- 
opment of the probability theory of self-normalized proce s ses b egan in the 
1990's after the seminal work of Griffin and Kuelbs 119911 ) on laws of 
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the iterated logarithm (LIL) for self-normahzed sums of i.i.d. variables be- 
lon ging to the domain of att raction of a normal or stable law. In particu- 
lar, iBentkus and Gotz derived a Berr y-Esseen bound for Student's t- 
statistic, and Gine, Gotze and Mason (|l997l ) proved that the i-statistic has 
a limiting standard normal distribution if and only if Xi is in the domain of 
attraction of a normal law, by making use of exponential and Lp bounds for 
An/Bn, where An = X)"=i ^^"^ = '^ll=i-^i- The limiting distribution 
of the t-statistic when Xi belongs to the do main of attraction of a stable law 
had b een studied earher by Efron (1969) and lLogan. Mallows. Rice and Shepnl 
Whereas Gine, Gotze and Mason's result settles one of the conjec- 
tures of Logan, Mallows, Rice and Shepp that the self-normalized sum "is 
asymptotically normal if (and perhaps only if) X-\ is in the domain of attrac- 
tion of the normal law," Chistyakov and Gotze ([200J) have settled that other 
conjecture that the "only poss ible n o ntrivi al limiting distributions" are those 
when Xi follows a stable law. IShaol ( 1997 ) proved large deviation results for 
E"^ J / -y/S"^ J Xf without moment conditions and moderate deviation results 
when X-\ is the domain of attraction of a normal or stable law. Subsequently 
IShaol ( 1997 ) obtained Cramer- type large deviation results when < oo. 



Jing, Shao and Zhou (2003) derived sa ddlepoint approximations for St udent's t- 
statistic with no moment assumptions. Bercu. Gassiat and Rio ( 20021 ) obtained 
large and moderate deviation results for self- normalized empirical processes. 
Self-normalized su ms of independent but non-identica ll y distributed Xj have 



been considered bv lBentkus. Bloznelis and Gotze (Il996l) . Wang and Jing 
and Jing, Shao and Wang r2002V lChenl (|l999t ). IWori^(|2000l ) and lBercu 



1999 ) 



20011) 



have provided extensions to self-n ormalized sums of functi ons of ergodic Markov 
chains and autoregressive models. iGine and Mason relate the LIL of self- 

normalized sums of i.i.d. random variables Xj to the stochastic boundedness of 



lEgorovl (Il998l ) gives expo nential inequalities for a centered variant of (jl.ip . 
Along a related line of work, ICaballero. Fernandez and NualartI (jl998l ) provide 
moment inequalities for a continuous martingale divided by its quadratic vari- 
ation, and use these results to show that if {Mt^t > 0} is a continuous martin- 
gale, null at zero, then for every 1 < p < (?, there exists a universal constant 
C = C{p, q) such that 

IIt^ ||,<qi^||,. (1.2) 

Related work in Revuz and Yor (|l999| page 167) for continuous local martingales 
yields for all p > g > the existence of a constant Cpq such that 



E 



(sup,<^|M,|)P 



(M) 



9/2 

OO 



< a 



pq 

,E{sup \M,\)P- 

s<oo 



(1.3) 



In Section [2] we describe the approaches of Ide la Pefia to exponential 

inequalities for strong-law norming and of de la Pena, Klass and Lai (l200d 
2004) to exponential inequalities and Lp-bounds of self-normalized processes. 
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Section [3] considers laws of the iterated logarithm for self-normalized martin- 
gales. Section 2] concludes with a discussion and review of self-normalization 
and pseudo-maximization in statistical applications. 

2. Self-normalization and pseudo-mELximization 

2. 1 . Motivation 

"BEHIND EVERY LIMIT THEOREM 
THERE IS AN INEQUALITY" 

This folklore has been attributed to Kolmogorov. 

Example 2.1. Let -^^ be a sequence of random variables. Then, to show that 
^ ^ /i in probability, Markov's inequality is often used: 

P{\^-^J^\>e)<E\^-^^\P|e^. 

The weak law of large numbers for sums of i.i.d. random variables with finite 
variance uses the case p = 2 and the fact that the variance of the sum is the 
sum of the the variances. What happens when the variance is infinite, and when 
a„ depends on {Xi,X2, . . . , X„) ? 

Example 2.2 (Almost Sure Growth Rate of a Sum). For non-decreasing a„, if 
we can show for some K > and for all e > that 

P{— > (l + 3e)Xi.o.} = 0, 

an 

then we can conclude that limsup-^^ < K a.s. By the Borell-Cantelli lemma, 
it suffices to show that for some 1 < rii < n2 < ■ ■ ■ with Uj < (1 + e)a„j. for 
nk l£ j < nk+i whenever k is sufficiently large, 



P{ max Sj > (1 -I- e)Kan^) < oo. 

, ' l<i<nfc+l 
k—l 



Problems of this type frequently reduce to the use of Markov's inequality and 
finding appropriate bounds on i?exp(A>5'„/a„) to show that the above series 
converges. We are particularly interested in situations in which a„ depends on 
the available data and hence is random. This further motivates the study of 
self-normalized sums. 

Example 2.3. Consider the autoregressive model 



Yn = aYn-i+en, Yo = 0, 
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where a is a fixed (unknown) parameter and e„ are independent standard normal 
random variables. The maximum likelihood estimate (MLE) a„ of a is the 
maximizer of the log likelihood function 

n 

log/„(ri, r„) = ^(Fj - ay,-i)V2 - nlog(V2^). 
Differentiating with respect to a and equating to zero, we obtain 

— — — a - ' 



Therefore, d„ — a can be expressed as a self-normalized random variable: 

In (j2.ip . the numerator '■= X]j=i ^-i^i ^ martingale with respect to 
the filtration JF„ := ct(Yi, l^^; ei, e„). The denominator of (|2.ip is 

n n 

which is the conditional variance of A„. Therefore, d„ — q = An/ Bf^ is a process 
self-normalized by the conditional variance. Since the ej are A^(0, 1), for all n > 1 

and —GO < A < oo, 

M„ := exp{A^r,_ie, - ^''^ l 

is an exponential martingale and therefore satisfies the canonical assumption 
below, which we use to develop a comprehensive theory for self-normalized pro- 
cesses. 



2.2. Canonical assumption and exponential bounds for strong law 

We consider a pair of random variables A, B, with B > such that 

E exp{AA - \^B^/2} < 1. (2.2) 

There are three regimes of interest: when (|2.2p holds 

(i) for all real A, 

(ii) for aU A > 0, 

(iii) for all < A < Aq, where < Aq < oo. 
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Results are presented in all three cases. Such canonical assumptions imply var- 
ious moment and exponential bounds, inc l uding the following bound connected 
to the law of large numbers in lde la Penal ( 19991 ) for case (i). 



P{A/B^ > x,l/B^ <y)< exp(-^) (2.3) 



Theorem 2.1. Under the canonical assumption for all real \, 
for all X, y > 0. 

Proof. The key here is to "keep" the indicator when using Markov's inequality. 
In fact, for all measurable sets S, 

P{A/B^ > x,S} = P{exp(^) > exp{xB^),S} 

< inf E[exp{^A - ^xB^}l{A/B^ > x, S)] 

A>0 z A 

= inf E\exp\-A - —B^ - (-x - —)BHi(A/B^ > x, 8)1 



< Eexp{\A- ^B^})^ E[exp{~iXx ~ ^)B^}liA/B^ > x,S))], 

by the Cauchy-Schwarz inequality. The first term in the last inequality is bounded 
by 1, by the canonical assumption. The value minimizing the second term is 
A = X, and therefore 



P{A/B' >x,S)< ^E[cxp{^—}l{A/B^ > x,S)]. 
Letting S = {1/B^ < y} gives the desired result. □ 
Applying this bound with y ~ I/2; to both An and — An in Example 12 . 31 yields 

n 

P(|d„ -a\> a;,X!^/-i > ^) < 2cxp{-x'^z/2) 



for all x,z > 0. The following variant of Theorem 12.11 generalizes a result of 
Lipster and Spokoiny (1999). 

Theorem 2.2. Under the canonical assumption for all real X, 

P{\A\/B > x,b < B <bs) < 4Via;(l + logs)exp{~a;V2}, 
for all b > 0, s > 1 and a; > 1. 

Example 2.4 (Martingales and Supermartingales). The Appendix provides 
several classes of random variables that satisfy the canonical assumption (|2.2p 



V. H. de la Pena et al. / Pseudo-maximization and self-normalized processes 178 



in a series of lemmas (jA.lHA.8"1) . These lemmas are closely related to martingale 
theory. Moreover, Lemmas I A. 2H A . 8l are about a supermartingale condition (|2.7p 
that is stronger than (|2.2[) for the regime < A < Aq. 



Theorem 12.11 gives an inequality related to the law of large numbers. There 
is a class of results of this type in iKhoshnevisanI ( 1996I ). who points out that 



it essentially dates back to McKean (1962). A related reference is Freedman 
(1975). 

Theorem 2.3. Let {Mt,J-'t; < i < oo} be a continuous martingale such that 
Eexp{'-fMao) < oo for all 7. Assume that Mq — 0, that J-q contains all null 
sets and that the filtration {J-t,0 <t< 00} is right- continuous. Then, for any 
a,(3,X> 0, 

P{M^ > (a + /3(Af)oo)A} < exp(-2a/3A2). (2.4) 



Khoshnevisanl ( 19961 ) has also shown that if one assumes that the local mod- 



ulus of continuity of Mj is in some sense deterministic, then the inequality ()2.4p 
can be reversed (up to a multiplicative constant). As applications of this result, 
he present s some large deviation results for such martingales. A related paper 
of Dembol ( 19961 ) provides moderate deviations for martingales with bounded 



jumps. Concerning moderate deviations, a general context for extending re- 
su lts like Theorem 12.31 to the case of discrete-time martingales can be found 



de la Pena (1999j), who provides a decoupling method for obtaining sharp 



extensions of exponential inequalities for martingales to the quotient of a mar- 
tingale divided by i ts quadratic varia tion. In what follows we present three 
related results from de la Pena ( 19991 ). The first result is for martingales in 



continuous time and the last two involves discrete-time processes. The third 
result is a sharp extension of Bernstein's and Bennett's inequalities. (See also 
de la Pefia and Gine ( 19991 ) for details.) 



Theorem 2.4. Let {Mt,!Ft,t > 0}, Mq — 0, be a continuous martingale for 
which exp{AMt — A^(M)t/2}, t>0, is a supermartingale for all A > 0. Let A be 
a set measurable with respect to T^a- Then, for all < t < 00, /? > 0, a, a; > 0, 

P(Mt > ia + P{M)t)x,A) 

< E[exp{-x^{^{M)t + a(3)} \ Mt > {a + P{M)t)x, A], 

and hence 

1 0"^ 
P{Mt >{a + P{M)t)x, - — - < y for some t < 00} < exp{-a;^(— + a/3)}. 

{M)t 2y 

Theorem 2.5. Let {Af„ — d-hTmn > 0} 6e a sum of conditionally sym- 

metric random variables {L{di\Ti-i) — L{—di\Ti-i) for all i). Let A be a set 
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measurable with respect to Too- Then, for all n > 1, /3 > 0, a, x > 0, 

n 
i=l 

< £;[exp{-x2(^ ^ _^ a/3)} I M„ > {a + d^^^^ 

i=l i=l 

and hence 

F{Af„> (a + /3^d2)x,^^7^— 2 < y /or some n > 1} < exp{-a;2(^ + a/3)}. 

Theorem 2.6. Let{M„ = Y^^=i^i' ^n,n > 1} be a martingale with E{d'j\Tj^i) 
a'j < oo and let — 'YTj=\'^']- Furthermore assume that E{\dj\'' I Tj-i) < 
{k\/2)a'^c^~^ a. s. for all k > 2 and some c > 0. Then, for all J- oo -measurable 
sets A, X > 0, 



p(M^ > X, a) < ijfcxpl - f y^] I ^ > . 



X, A 



and hence 



> x,:^ < y for some n J 



-''^^l y (l + ex + VI + 2cx) } ~ '^'^^ ( 2y(l + ca:)} 



2.5. Pseudo-maximization (Method of Mixtures) 

Note that if the integrand exp{AA — A^i?^/2} in p.2p can be maximized over A 
inside the expectation (as can be done \i A/ B'^ is non-random), taking A = A/B"^ 
would yield i?exp(^^) < 1. This in turn would give the optimal Chebyshev- 

type bound > x) < exp(^|^). Since A/B'^ cannot (in general) be taken 

to be non-random, we need to find an alternative method for dealing with this 
maximization. One approach for attaining a similar effect involves integrating 
over a probability measure F, and using Fubini's theorem to interchange the or- 
der of integration with respect to P and F. To be effective for all possible pairs 
{A, B), the F chosen would need to be as uniform as possible so as to include 
the maximum value of exp{A^ — A^i?^/2} regardless of where it might occur. 
Thereby some mass is certain to be assigned to and near the random value 
A = A/B"^ which maximizes XA — X^B^ /2. Since all uniform measures are mul- 
tiples of Lebesgue measure (which is infinite) , we construct a finite measure (or 
a sequence of finite measures) which tapers off to zero at A = oo as slowly as we 
can manage. This approach will be used in what follows to provide exponential 
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and moment inequalities for A/B, A/y/W+jEBf, A/{B^J\og\og{B\/ e^)}. 
We begin with the se cond case where the pr o of is more transparent. The ap- 
proach, pioneered by iRobbins and Siegmundl ( 1970l ) and commonly known as 
the method of mixtures, was used by de la Peha, Klass and Lai (2004) to prove 
the following. 

Theorem 2.7. Let A, B with B > be random variables satisfying the canonical 
assumption l\2.S^ for all X Then 



Pi 



\A\ 



> x)< V2exp(-a;V4) 



(2.5) 



for all X > 0. 



Proof Multiplying both sides of ((Ml) by {2Tr)-'^/'^y exp{~X'^y'^ /2) (with y > 0) 
and integrating over A, we obtain by using Fubini's theorem that 



1 ^ lZE^^-H^^-'tB')e.v[-^)dX 



= E 



exp 



{ 2{B^+y'^} } ^ /- 



oo 
oc 



V27r 



exp{-^(A2-2^A+ ^ 



2 V" "W+y^ 
y „„„ / A- 



{B'+y^r 



By the Cauchy-Schwarz inequality and 



}dX 



(2.6) 



< 



< 




Since E^ -pr + 1 < E{^ + 1), the special case y — EB gives 

i;exp(^V[4(S^ + [EBf]) < V2. 
Combining Markov's inequality with this yields 
1^1 



> X 



A' 

4(^2 + {EBy 



> ^} < \/2 exp(-a;V4). 



□ 



In what follows we discuss the analysis of certain boun dary crossing probabil- 
ities b y using the method of mixtures, first introduced in iRobbins and Siegmund 
( 197Ci l. under the following refinement of the canonical assumption: 

{exp(AAt — X^Bf/2), t > 0} is a supermartingale with mean < 1 (2.7) 
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for < A < Ao, with An = 0. W e begin by introducing the Robbins-Siegmund 
(R-S) boundaries; see iLail Let F be a finite positive measure on (0, Aq) 

and assume that F(0, Aq) > 0. Let w) = / exp(Au - X'^v/2)dF{X). Given 
c > and v > 0, the equation 

^'(m, v) = c 

has a unique solution 

u = (3f{v, c). 
Moreover, 13f(v, c) is a concave function of v and 

hm ^ 6/2, 



where 



6 = sup|y > : j F{d\) = o|, 



with sup over the empty set equal to zero. The R-S boundaries (3f{v, c) can be 
used to analyze the boundary crossing probability 

P{At > 9{Bt) for some t > 0} 

when g{Bt) — PriBf, c) for some F and c > 0. This probability equals 



(2.8) 



P{At > l3F{Bf,c) for some t > 0} 
= P{'^{At, Bf) > c for some t > 0} < F(0, Ao)/c, 

applying Doob's inequality to the supermartingale "if {At, Bt), t > 0. 
Example 2.5. Let S > and 

'^^^^^ " A(log(l/A))(loglog(l/A)i+^'^^ 

for < A < e~'^ . Let logQfx) = loglogfy) and log3(a;) = loglog2(a;). As shown 
in Example 4 of iRobbins and Siegmund for fixed c > 0, 



/3f(w,c) = ^2«[log2 w + (3/2 + S) logg v + log(c/2V^) + o(l)], 

as ti ^ oo. With this choice of i^, the probability in (|2.8p is bounded by 
F(0, e^'^)/c for all c > 0. Given e > 0, take c large enough so that F{Q, e~'^)/c < 
e. Since e can be arbitrarily small and since for fixed c, (3f{v, c) ~ ^/2v\og logw 
as V ^ oo, 

At 



lim sup — , < 1 

y/2B't\og\oget 

on the set {limt^oo Bt = oo}. 
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2.4- Moment inequalities for self-normalized processes 

The inequality (|1.2p obtained by ICaballero. Fernandez and Nualart (1998) is 
used by them to estabhsh the continuity and uniqueness of the solutions of a 
nonlinear stochastic partial differential equation. A natural question that arises 
in connection with (jl.2p is what about the case when the normaliz ation is done 
by \/{M)t. The following result from de la Pefia, Klass and Lai (|200d . 2004) 
provides an answer to this question. 

Theorem 2.8. Let A, B > be two random variables satisfying l\2. 2\ for all 
A > 0. Let log+(x) = V logx. Then for all p> 0, 

E\^\'< ci,p + C2,pEilog+ log(i3 V ^)r/^ (2.9) 

E{ . ^"^ r < Cp, (2.10) 

B^lVlog+ log(SV-L) 

for some universal constants < Ci,p,C2^p,Cp < oo. 

The following example shows the optimality properties of Thcorem l2.8l 

Example 2.6. Let {Yi}, i = 1,... be a sequence of i.i.d. Bernoulli random 
variables such that P{Y, = 1} = P{Y, = -1} = i. Let 

n 

T = inf{n > 6*= : ^ > y/2n log log n}, 

1=1 

with T = oo if no such n exists. From the delicate LIL by lErdod ( 1942f ). it 
follows that P{T < oo) = 1. Let d„j = Yjl{T > j) if 1 < j < n and dn,j = if 
j > n. Then almost surely, 

, , , , Y^inin{T,n} ^ 

+ ••■ + "n,Ti _ l^j = l ^3 



<! + ... + <„ Vniin{T,n} 



v/21oglogr, 



as n — !■ oo. Therefore, the second term in (|2.9p can not be removed. 

The proof of Theorem 12.81 is based on the following result. Let L : (0, oo) 
(0,oo) be a nondecreasing function such that 

L{cy) < 3cL{y) for aU c > 1 and y > 0, (2.11a) 

L{y^) <3L{y) for £dly>l, (2.11b) 



xL(x) 

Then 



AL(max(A, 1/A)) 
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EtTTT^^TTTT:^. < -71 -TTT-- (2.12) 



is a density. An example satisfying (|2.11al b.c) is 

L{x) = 2(logxe^)(loglogxe'=)2l(a; > 1) 
Let g{x) = £Sl£!Z2}i(2; > 1). Then 

.9(4) ^ 3 

L(|)VL(i?vi) - J^e^p--y^dx 

Making use of this, de la Pefia, Klass and Lai |200d . 2004) extend the inequalities 

in (EH) and (l2l0l) to Eh ( ^] and Eh ( ^ = ) for nonnegative 

non-decreasing functions h{-) such that h{x) < g^{x) for some < S < 1. 

2.5. An expectation version of the LIL for self-normalized processes 

We next study the case of self-normalized inequalities when there is no possi- 
bility of explosion at the origin by shifting the denominator away from zero . An 



important result in this direction comes from iGraversen and Peskiij ()200i 



Theorem 2.9. Let {Mt,J-^t, t > 0} be a continuous local martingale with 
quadratic variation process {M)t, t > 0. Let l{x) = y^\og{l + log(l + x)). Then 
there exist universal constants Di, D2 > such that 

D,Eim)r) < E{^^^^^f=^) < D,Eim)ri 

for all stopping times r oj M . 

The proof of this result was obtained by making use of Lenglart's inequality, 
the opt ional sampling theorem and Ito' s formula. Shortly after this result ap- 
peared, Ide la Peha. Klass and Lai introduced the moment bounds in the 



last section, in which the denominator is not shifted away from 0. They then 
realized that shifted moment bounds can also be obtained for the case in which 
(|2.2p or (|2.7p only hold for < A < Aq. Subsequently, de la Peha, Klass and 
Lai (2004) proved part (i) of the following theorem for more general processes 
than continuous local martingales. Part (ii) of the theorem can be proved by 
arguments similar to those of Theorem 2 of de la Peha, Klass and Lai ( 2000l ). 



Theorem 2.10. Let T = {0, 1,2,...} or T ^ [0, 00), 1 < q < 2, and <^q{0) = 
6'' /q. Let At and Bt > be stochastic processes satisfying 

{exp(AAf — $q(A-Bf)),t G T} is a supermartingale with mean < 1 (2.13) 



for < A < Ao and such that Aq ^ and Bt is nondecreasing in t > 0. Ln 
the case T — [0,oo), assume furthermore that At and Bt are right- continuous. 
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Let L : [l,oo) — > (0,oo) be a nondecreasing function satisfying (2.11a, 6, c). Let 
r] > 0, XoTj > 6 > 0, and h : [0, oo) — > [0, oo) be a nondecreasing function such 
that h{x) < e*^ for all large x. 

(i) There exists a constant k depending only on Xo,r], S, q, h and L such that 

£:/i(sup|^t(BtV7/)-Ml Vlog+L(BtV77)]-(«-i'/''}) < k. (2.14) 

(ii) There exists k such that for any stopping time t, 

Eh ( snp \{BtV ri)-^At\) <K + Eh{\KV \og+ L{BtVT])]^''-^'>^'^). (2.15) 



3. Self-normalized LIL for stochastic sequences 

3.1. Stout's LIL: Self -normalizing via conditional variance 

A well-known result in martingale theory is Stout's (1970, 1973) LIL that uses 
the square root of the conditional variance for normalization. The key to the 
proof of Stout's result is an exponential supermartingale in Lemma lA.SI of the 
Appendix. 

Theorem 3.1. Let{dn,Tn},n = 1,... be an adapted sequence with E{dn\Tn-i) < 
0. Set Mn = ElLi d^, al = 'Z7=i E{d^\T^-l). Assume that 

(i) dn < rUn for Tn-i-measurable mn > 0, 

(ii) (T^ < oo a.s for all n, 

(iii) lim„_oo cr^ = oo a^ 

(iv) limsup„^oc m„yiogIog(CT2)/cr„ = a.s. 
Then 

limsup —^^^^=^^^= < 1 a.s. 

V2cr2 loglogcr„ 



3.2. Discrete-time martingales: self-normalizing via sums of 
squares 

By making use of Lemma lA. 81 (see Appendix), de la Peiia, Klass and Lai (2004) 
have proved the following upper LIL for self-normalized and suitably centered 
sums of random variables, under no assumptions on their joint distributions. 

Theorem 3.2. Let A"„ be random variables adapted to a filtration {J-n}- Let 
Sn = Xi + . . . + Xn and ^ Xf + . . . + X^. Then, given any A > 0, there 
exist positive constants c\ and bx such that limA^o bx = V2 and 

y Sn ~J2i=l t^ii^'^'^niCxVn) . , , , 

hmsup — - — - — , < bx a.s. on {limK = oo}, (3.1) 

Vn(l0g\0gVn)'^'-^ 

where w„ = T4(log log T4)"^^^ and fit[c,d) = E{XAic < X < d)\J^t^i} for 
c< d. 
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The constant bx in Theorem 13.21 can be determined as foUows. For A > 0, 
let h{X) be the positive solution of h — log(l + h) = Let b\ = h{X)/X, 
7 — h{X)/{l + h{X)}, and c\ is determined via A/7. Then limA^o bx = Let 
Cfe = exp(A:/ logfc). The basic idea underlying the proof of Theorem 13 . 21 pertains 
to upper-bounding the probability of an event of the form = {tk-i < rt < 
tk\, in which tj and Tj are stopping times defined by 

tj = inf{n : V"„ > Cj}, 

Tj = inf{n > tj : S,, - J27=i f^i[~>''"n,cxv„) > (1 + 3e)5AK(loglogK)^/^}, 

(3.2) 

letting inf — 00. Note that for i < n, the centering constants Aw„, caw„) 
involve w„ that is not determined until time n, so the centered sums that re- 
sult do not usually form a martingale. However, sandwiching between tk-i 
and tk enables us to replace both the random exceedance and truncation levels 
in (j3.2p by constants. Then the event Ek can be re-expressed in terms of two 
simultaneous inequalities, one involving centered sums and the other involving a 
sum of squares. These inequalities combine to permit application of Lemma lA. 81 
(see Appendix). Thereby we conclude that P{Ek) can be upper-bounded by the 
probability of an event involving the supremum of a nonnegative supermartin- 
gale with mean < 1, to which Doob's maximal inequality can be applied; see 
de la Peiia, Klass and Lai (2004, pages 1924-1925) for details. 

Although Theorem 13.21 gives an upper LIL for any adapted sequence of ran- 
dom variables X„, the upper bound in p.ip may not be attained. Example 
6.4 of de la Pena, Klass and Lai (2004) suggests that one way to sharpen the 
bound is to center Xn at its conditional median before applying Theorem 13.21 
to Xn = Xn — med(X„|.7-"„_i). On the other hand, if Xn is a martingale dif- 
ference sequence such that \Xn\ < TUn a.s. for some J>i_i-measurable m„ with 
m„ — o{vn) and u„ — > 00 a.s., then Theorem 6.1 of de la Pena, Klass and Lai 
(2004) shows that (|3.ip is sharp in the sense that 

limsup — - — f" = ^/2 a.s. (3.3) 

y„(loglogK)V^ 

The following example of de la Pena, Klass and Lai (2004) illustrates the 
difference between self- normalization by Vn and by (T„ in Theorems l3.2l and l3.1l 

Example 3.1. Let Xi — X2 — 0, X^, X4, ... be independent random variables 
such that 

P{Xn - -l/V^} = 1/2 - n-i/2(logn)i/2 _ ^-^(logn)-^, 

P{Xn = -m„} = n-\logn)-^, P{X„ = = 1/2 + n-^^^logn)^/^ 

for n > 3, where to„ ~ 2(logn)^/2 is chosen so that EX„ = 0. Then P{X„ = 
-nin i.o.} = 0. Hence, with probability 1, = + 0(1) = logn -I- 0(1) 

and it is shown in de la Pena, Klass and Lai (2004, page 1926) that 

Sn 4(logn)3/2 

• ~o a.s., (3.4) 



Vn (log log 14) 3{ (log n) (log log log n) } 1/2 
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and that 

limsup ~Y.^EXd{\X.,\ < l)^/{K(loglogK)'/'} - \/2 a.s. (3.5) 

Note that mn/v„ — s- oo a.s. This shows that without requiring to„ — o(u„), the 
left-hand side of p.3p may be infinite. On the other hand, X„ is clearly bounded 
above and cr^ = X]"=i Var(Xj) 4:J2'^^^{logi)^ /i (logn)**, yielding 

4(logn)3/2 

■ a.s., (3.6) 



cr„ (log log s„)i/2 3 (log n)2 (log log log n) 

which is consistent with Stout's (1973) upper LIL. Note, however, that mn/an ~ 
2(logn)-^/^ oo and therefore one still does not have Stout's (1970) lower LIL 
in this example. 

4. Statistical applications 

Most of the probability theory of self-normalized processes developed in the last 
two decades is concerned with At self-normalized by Bt in the case At = Si<(Xi 
is a sum of i.i.d. random vectors Xi and Bt — (T.i<tXiXiy^^ , using a key 
property that 

Eexp{e'Xi - p{e'Xif} < oo for all 6* G R'' and p > 0, (4.1) 



as observed by Chan and Lai (|2000L pages 1646-1648). In the i.i.d. case, the 



finiteness of the moment generating function (14. ip enables one to embed the 
underlying distribution in an exponential family and one can then use change 
of measures (exponential tilting) to derive saddlepoint approximations for large 
or moderate deviation probabilities of self-normalized sums or for more general 
boundary crossing probabilities. Specifically, letting C„ = Yi^^^XiX'^ and '(/'(0, p) 
denote the left hand side of (14.11). we have 



£;[exp{0'A„ - pO'CnO - n^{e, p)}] = 1. (4.2) 

Let Pg_p be the probability measure under which {Xi^ XiX[) arc i.i.d. with den- 
sity function exp(6''Xi — pO' XiX[0) with respect to P = Po.o- The random vari- 
able inside the square brackets of ()4.2p is simply the likelihood ratio statistic 
based on {Xi, . . . ,X„), or the Radon-Nikodym derivative dPg^^ / dP''"\ where 
the superscript (n) denotes restriction of the measure to the a- field generated by 
{Xi, . . . , Xn}. For the case of dependent random vectors, although we no longer 
have the simple cumulant generating function ntp{0,p) in (|4.2p to develop pre- 
cise large or moderate deviation approximations, we can still derive exponential 
bounds by applying the pseudo-maximization technique to (|2.2p or (|2.13p , which 
is weaker than (|4.2p . as shown in the following two examples from de la Pefia, 
Klass and Lai (2004, pages 1921-1922) who use a multivariate normal distribu- 
tion with mean and covariance matrix for the mixing distribution F to 
generalize (|2.8p to the multivariate setting. 
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Example 4.1. Let {dn} be a sequence of random vectors adapted to a filtration 
{J^n} such that di is conditionally symmetric. Then for any a > f and any 
positive definite k x k matrix V , 

i [Y^-^d'AiV ^Y.'^dd'A-^iY.'^di) , 1 1 



[logdet(V" + S?^did9 + 21og(a/ydctv7 J a 

Example 4.2. Let Mt be a continuous local martingale taking values in R'^ 
such that Mq = 0, limt^oo Ainin((-^^)t) = oo a.s. and E exp{0' {M) tO) < oo for 
all 9 e R*^ and t > 0. Then for any a > 1 and any positive definite k x k matrix 
V, 

p| MiiV^{M),r^M _J_f 

\ log det ( F + ( Af ) t ) + 2 log(a/ ^det {M)t) J a 

The reason why equahty holds in (gS]) is that {/ f{9) exp(9'Mt ~ 9' {M)t9/2)d9, 
t > 0} is a nonnegative con tinuous martingale to which an equality due to 
iRobbins and Siegmundl ( 19701 ) can be applied; see Corollary 4.3 of de la Peha, 



Klass and Lai (2004). 

Self- normalization is ubiquitous in statistical applications, although An and 
Cn need no longer be linear functions of the observations Xi and XiX[ as in the 
^-statistic or Hotclling's T^-statistic in the multivariate case. Section 14.11 gives 
an overview of self- normalization in statistical applications, and Section 14.21 
discusses the connections of the pseudo-maximization approach with likelihood 
and Bayesian inference. 

4.1. Self -normalization in statistical applications 

The i-statistic ^Jn{Xn — lj)/sn is a special case of more general Studentized 
statistics {9n — 9) /se„ that are of fundamental importance in statistical infer- 
ence on an unknown parameter 9 of an underlying distribution from which the 
sample observations Xi, . . . ,Xn are drawn. In nonparamctric inference, is a 
functional g{F) of the underlying distribution function F and 9n is usually cho- 
sen to be g(Fn), where Fn is the empirical distribution. The standard deviation 
of 9n is often called its standard error, which is typically unknown, and se„ 
denotes a consistent estimate of the standard error of 6'„. For the t-statistic, n 
is the mean of F and Xn is the mean of F„. Since Var(X„) — Var(Xi)/n, we 
estimate the standard error of X„ by Snj \/ri, where s\ is the sample variance. 
An important property of a Studentized statistic is that it is an approximate 
pivot, which means th at its distribution is approximately the same for all 9; see 



Efron and Tibshirani (|1993j. Section 12.5) who make use of this pivotal property 
to construct bootstrap-t confidence intervals and tests. For parametric prob- 
lems, 9 is usually a multidimensional vector and 9n is an asymptotically normal 
estimate (e.g., by maximum likelihood). Moreover, the asymptotic covariance 
matrix S„(6') of 9n depends on the unknown parameter 9, so I]„ {9n){9n — 9) 
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is the self-normalized (Studentized) statistic that can be used an approximate 
pivot for tests and confidence regions. 

The theoretical basis for the approximate pivotal property of Studentized 
statistics lies in the limiting standard normal distribution, or in some other 
limiting distribution that does not involve 9 (or F in the nonparametric case). To 
derive the asymptotic normality of On, one often uses a martingale M„ associated 
with the data, and approximates E„ ^^^{9n){9n—0) by (M)„ ^^^M„. For example, 
in the asymptotic theory of the maximum likelihood estimator 0„ , S„ (9) is the 
inverse of the observed Fisher information matrix /„(0), and the asymptotic 
normality of 9n follows by using Taylor's theorem to derive 

n 

-In{9)(9n - 0)=^ Vl0g/e(X,|Xi, . . . , X,_i), (4.4) 
i=l 

The right-hand side of (|4.4|) is a martingale whose predictable variation is 
—In{9). Therefore the Studentized statistic associated with the maximum like- 
lihood estimator can be approximated by a self-normalized martingale. 



4.2. Pseudo-maximization in likelihood and Bayesian inference 

Let Xi, . . . , Xn be observations from a distribution with joint density function 
fe{xi, . . . , Xn)- Likelihood inference is based on the likelihood function Ln{9) = 
fs{Xi, . . . , Xn), whose maximization leads to the maximum likelihood estimator 
9n- Bayesian inference is based on the posterior distribution of 9, which is the 
conditional distribution of 9 given Xi , . . . , Xn when 9 is assumed to have a 
prior distribution with density function tt. Under squared error loss, the Bayes 
estimator On is the mean of the posterior distribution whose density function is 
proportional to Ln{0)'n{0), i.e., 

9n= I 07T{0)Ln{0)dO / [ TT{0)Ln{0)dO. (4.5) 



Recall that Ln{0) is maximized at the maximum likelihood estimator On- Ap- 
plying Laplace's asymptotic formula to the integrals in (|4.5I) shows that On is 
asymptotically equivalent to On, so integrating over the posterior distribution 
in (|4.5p amounts to pseudo-maximization. 

Let ^0 denote the true parameter value. A fundamental quantity in likelihood 
inference is the likelihood ratio martingale 

Ln{&) £„(e) 1 n v^, fe{Xi\Xi, . . . ,Xi^i) 
— — = e J, whereC(g) = 2^1og — — . (4.6) 

Note that V^„(6'o) is also a martingale; it is the martingale in the right-hand side 
of (|4.4p . Clearly f e^"'^^^T:( 0)dO is also a martingale for any probability density 



function tt of 9. iLail ()2004f ) shows how the pseudo-maximization approach can 
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be applied to e^"^^-* to derive boundary crossing probabilities for the generalized 
likelihood ratio statistics £n{On) that lead to efficient procedures in sequential 
analysis. 



Appendix 

Lemma A.l. Let Wt he a standard Brownian motion. Assume that T is a 
stopping time such that T < oo a.s.. Then 

Eexp{XWT - X'^T/2} < 1, 

for all A e R. 

Lemma A. 2. Let Mt be a continuous, square-integrable martingale, with Mq = 
0. r/ien exp{AAft — A^(Af)f/2} is a super marling ale for all \ ^ R, and therefore 

Eexp{XMt - \^{M)t/2} < 1. 

// Mt is only assumed to be a continuous local martingale, then the inequality is 
also valid {by application of Fatou's lemma). 

Lemma A.S. Let {Mt '■ t > 0} be a locally square-integrable martingale, with 
Mq — 0. Let {Vt} be an increasing process, which is adapted, purely discon- 
tinuous and locally integrable; let V^^^ be its dual predictable projection. Set 
Xt = Mt + Vt, Ct = E.<.((AX,)+)2, Dt = {E.<t((A^.)-)nl"\ Ht = 
{M)t + Ci + Dt- Then exp{Xj — v/^' — \^t} a supermartingale and hence 

£;exp{A(Xt - T//^^) - \^Htl2} < 1 for aU A G R. 



Lemma [A.2I follows from Proposition 3.5.12 of iKaratza s and Shrev 



1991 ) 



19861) . 



Lemma [A.3I is taken from Proposition 4.2.1 of iBarlow. Ja cka and Yor 
The following lemma holds without any integrability conditions on the variables 
involved. It is a generalization of the fact that if X is any symmetric random 
variable, then A = X and B = \X\ satisfy the canonical condition (|2.2p : see 
de la Penal (|l999l ). 



Lemma A. 4. Let {di} be a sequence of variables adapted to an increasing 
sequence of a -fields {J-i}. Assume that the di's are conditionally symmetric 
(i.e., C{d^\T^-i) = Ci-d^\T^-l)). T/ien exp{AI]:[Lid, - A2l]«^id2/2}, n>l, is 
a supermartingale with mean < 1, for all A G R. 

Note that any sequence of real-valued random variables Xi can be "sym- 
metrized" to produce an exponential supermartingale by introducing random 
variables X^ such that 

■^(^nl^l) ■ • ■ I -'^^-1, -'^n) = ^{Xn\Xi, . . . , Xn-l) 
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and setting dn — Xn — X^; see Section 6.1 of de la Pefia and Gine ( 19991) 



For the next four lemmas, Lemma IA.5I is the tool used by Stout (Il970l Il973l ) 
to obtain the LIL for martingales, Lemma IA.6I is de la Pefia's (|l999l ) exten- 
sion of Bernstein's inequality for sums of independent variables to martingales, 
and Lemmas IA.7I and IA.8I correspond to Lemma 3.9(ii) and Corollary 5.3 of 
de la Peha, Klass and Lai (2004). 

Lemma A. 5. Let {dn} be a sequence of random variables adapted to an increas- 
ing sequence ofa-fields {Tn} such that E{dn\^n-i) < and dn < M a.s. for all 
n and some nonrandom positive constant M . Let < Aq < , An — X]r=i '^i' 
Bl = {l + iAoM)S^^i E{d'i\T,-i), Ao = Bo = 0. Then {exp(AA„ - ^X^B^), 
Tn, n > 0} is a supermartingale for every < A < Aq. 

Lemma A. 6. Let {dn} be a sequence of random variables adapted to an increas- 
ing sequence ofa-fields {J^n} such that E{dn\J-^n-i) — and = £'((i^|jF„_i) < 
oo. Assume that there exists a positive constant M such that E{\dn\''\^7i-i) < 
{k\/2)alM^-'^ a.s. or P(|d„| < M\Tn-i) = 1 a.s. for all n > I, k > 2. Let 
A = Etld^, yl = Y]l^xE{dl\T,.^), Ao = l/o = 0. Then {exp(AA„ - 
2(i-MA) "^^^K )' > 0} is a supermartingale for every < A < 1/Af. 

Lemma A. 7. Let {dn\ be a sequence of random variables adapted to an in- 
creasing sequence of a -fields {J^n} such that < and dn > — Af 
a. s. for all n and some non-random positive constant M . Let An = X^iLi ^i, 
K = ^C^Etidl Ao = Bo = where = -{7 + log(l - -f)}/j\ Then 
{exp(AA„ — ^X^B^,Tn, n>0} is a supermartingale for every < A < jM^^. 

Lemma A.S. Let {J^n} be an increasing sequence of a-fields and j/„ be Tn- 

measurable random variables. Let < 7n < 1 cind < A„ < 1/C^„ be Tn-i- 
measurable random variables, with given in Lemma \A.7[ Let /i„ — E{ynl{~'jn < 
Vn < A„)|J>i_i}. Then exp{^"^-^(yi — fii — X~^yf)} is a supermartingale whose 
expectation is <1. 
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